Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers

نویسندگان

  • D. Sculley
  • Gabriel Wachman
  • Carla E. Brodley
چکیده

Contemporary spammers commonly seek to defeat statistical spam filters through the use of word obfuscation. Such methods include character level substitutions, repetitions, and insertions to reduce the effectiveness of word-based features. We present an efficient method for combating obfuscation through the use of inexact string matching kernels, which were first developed to measure similarity among mutating genes in computational biology. Our system avoids the high classification costs associated with these kernel methods by working in an explicit feature space, and employs the Perceptron Algorithm using Margins for fast on-line training. No prior domain knowledge was incorporated into this system. We report strong experimental results on the TREC 2006 spam data sets and on other publicly available spam data, including near-perfect performance on the TREC 2006 Chinese spam data set. These results invite further exploration of the use of inexact string matching for spam filtering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spam Filtering through Multiple Pattern Bit Parallel String Matching Combining Shift AND and OR

Spam refers to unsolicited, unwanted and inappropriate bulk email. Spam filtering has become conspicuous as they consume a lot of network bandwidth, overloads the email server and drops the productivity of global economy. Content based spam filtering is accomplished with the help of multiple pattern string matching algorithm. Traditionally Aho Corasick algorithm was used to filter spam which co...

متن کامل

A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization

Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...

متن کامل

Filtering Obfuscated Email Spam by means of Phonetic String Matching

Rule-based email filters mainly rely on the occurrence of critical words to classify spam messages. However, perceptive obfuscation techniques can be used to elude exact pattern matching. In this paper we propose a new technique for filtering obfuscated email spam that performs approximate pattern matching both on the original message and on its phonetic transcription.

متن کامل

Active Multi-Field Learning for Spam Filtering

Ubiquitous spam messages cause a serious waste of time and resources. This paper addresses the practical spam filtering problem, and proposes a universal approach to fight with various spam messages. The proposed active multi-field learning approach is based on: 1) It is cost-sensitive to obtain a label for a realworld spam filter, which suggests an active learning idea; and 2) Different messag...

متن کامل

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006